Skip to content

simd: SimdProfile fine-grained detection + cpu-* pinning + leaf 7,1 CPUID#190

Closed
AdaWorldAPI wants to merge 5 commits into
masterfrom
claude/resolve-pr-369-conflicts-ozMXd
Closed

simd: SimdProfile fine-grained detection + cpu-* pinning + leaf 7,1 CPUID#190
AdaWorldAPI wants to merge 5 commits into
masterfrom
claude/resolve-pr-369-conflicts-ozMXd

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Four orthogonal additions that complement (don't duplicate) the simd_runtime/ + CpuOps work shipped in PRs #185#187. The parallel session built the runtime-dispatch trampolines and the GCC-codename DTO lookup; this PR adds the fine-grained CPUID detection + silicon-identity enum + compile-time pinning half of the matrix-doc plan.

Layer Where it lives now
Per-op LazyLock<fn> trampolines (vnni_dot_u8_i8, add_mul_f32, casts, matmul) crate::simd_runtime::* (PR #185)
Codename-to-tier DTO lookup (cpu_ops_for_cpu("sapphirerapids")) crate::simd_runtime::cpu_ops (PR #187)
Silicon-identity enum + runtime CPUID detect ladder crate::hpc::simd_profile::* (this PR)
Compile-time profile pinning via cargo features cpu-* features (this PR)

SimdProfile::detect() is strictly more granular than CpuOps-tier selection — it discriminates GraniteRapids vs SapphireRapids (amx_fp16 bit), CooperLake vs IceLakeSp (BF16 / VBMI mutex), TigerLakeU vs IceLakeSp (VP2INTERSECT). The two pieces compose cleanly; simd_profile() could feed CpuOps selection in a future commit if you want, but neither subsumes the other today.

Commits

de52a446 fix(simd):  SimdProfile::detect() consults amx_available() — Risk #3 closure
03b30e54 feat(examples): simd_profile_probe — hardware verification binary
b9a6fa0e feat(simd):  cpu-* cargo features for compile-time SimdProfile pinning
db3669e8 feat(simd):  SimdProfile enum + detect() implements dispatch matrix

What each commit adds

db3669e8SimdProfile enum + detect()

crate::hpc::simd_profile::SimdProfile — 14-variant enum that names silicon generations directly:

pub enum SimdProfile {
    GraniteRapids, SapphireRapids,
    Zen4Avx512, CooperLake,
    TigerLakeU, IceLakeSp, CascadeLake, SkylakeX,
    ArrowLake, HaswellAvx2,
    A76DotProd, A72Fast, A53Baseline,
    Scalar,
}

detect() walks the decision tree from .claude/knowledge/td-simd-cpu-dispatch-matrix.md § "SimdProfile::detect() mapping" lines 271-305 verbatim. Four load-bearing invariants preserved:

  1. GraniteRapids checked before SapphireRapids — GNR is a strict superset; checking SPR first would route GNR silicon as SPR and leave AMX-FP16 unused.
  2. Zen4 vs SPR via amx_tile (both have F+VBMI+BF16+FP16).
  3. CooperLake vs IceLakeSp via mutually-exclusive bits (CPL has BF16 without VBMI; ICX has VBMI without BF16).
  4. TigerLakeU vs IceLakeSp via avx512vp2intersect.

Also adds 3 new SimdCaps fields: avx512fp16, avx512vp2intersect, amx_fp16 — plus the CPUID leaf 7,1 reader (gated on leaf 7,0 EAX max-subleaf ≥ 1). Master's simd_caps.rs currently has no leaf 7,1 read; GNR's AMX-FP16 lives at CPUID.07H.1H:EAX bit 21, so the fix is a prerequisite for GNR dispatch regardless of which detection ladder uses it.

LazyLock-cached simd_profile() accessor.

b9a6fa0ecpu-* cargo features (compile-time pinning)

13 mutually-exclusive features that fold simd_profile() to a const at compile time and elide the LazyLock entirely:

cpu-gnr  cpu-spr  cpu-zen4  cpu-cpl  cpu-tigerlake  cpu-icx  cpu-clx  cpu-skx
cpu-arrowlake  cpu-haswell  cpu-a76  cpu-a72  cpu-a53

Mutual exclusion enforced by a const assert (_PIN_COUNT <= 1); enabling two simultaneously produces a build error. Pair with -Ctarget-cpu=<codename> for the codegen side. Use case: per-silicon distribution where --features cpu-spr produces an SPR-specialized binary that bypasses runtime detection.

Complementary to runtime-dispatch (PR #185): runtime-dispatch ships one binary that adapts across silicon; cpu-* ships per-silicon binaries that fold dispatch out entirely.

03b30e54examples/simd_profile_probe

Boot-on-silicon diagnostic per the matrix doc's TEST-promotion checklist § step 1: "Boot the binary on the silicon and confirm simd_profile() returns the expected variant." Prints:

  • Resolved SimdProfile variant + arch/family flags
  • Compile-time pinning status (with pinned variant if active)
  • Every CPUID-derived SimdCaps bit
  • ARM heuristic profile (when running aarch64)
  • Active compile-time target features
  • Per-variant matrix-doc cell summary
  • Re-runs the pinning_consistency invariant so regressions in the cfg cascade are caught post-deploy
cargo run --example simd_profile_probe --release                 # runtime
cargo run --example simd_profile_probe --release --features cpu-spr   # pinned

de52a446 — Risk #3 closure (AMX OS-state gating)

SimdProfile::detect() now consults simd_amx::amx_available() (CPUID + OSXSAVE + XCR0[17,18] + arch_prctl(XCOMP_PERM, 18) on Linux 5.19+) and demotes when CPUID reports AMX but the OS/hypervisor hasn't enabled the tile XSAVE state.

Verified on this Sapphire Rapids build host: CPUID reports amx_tile=1, amx_bf16=1, but amx_available() returns false (hypervisor masks XCR0 bits 17/18 or the prctl request fails). Without the fix, detect() would resolve as SapphireRapids and dispatch tables would route to AMX kernels that SIGILL. With the fix, it demotes to Zen4Avx512 (AVX-512 BF16/FP16 path), matching what amx_available() consumers already do inline.

Probe surfaces the CPUID-vs-OS gap directly so the demotion is visible without reading source.

Diff vs current master (eb6444f)

 Cargo.toml                     |  30 +++
 examples/simd_profile_probe.rs | 202 ++++++++++++++
 src/hpc/mod.rs                 |   3 +
 src/hpc/simd_caps.rs           |  99 ++++++-
 src/hpc/simd_profile.rs        | 597 +++++++++++++++++++++++++++++++++++++++++
 src/simd.rs                    |  16 ++
 6 files changed, 943 insertions(+), 4 deletions(-)

Test plan

How it composes with PRs #185#187

                ┌─ crate::simd_runtime::* (PR #185)  ←  runtime LazyLock trampolines
                │
caps = simd_caps()
                │
                ├─ crate::simd_runtime::cpu_ops (PR #187)  ←  named-target lookup
                │
                └─ crate::hpc::simd_profile::detect() (THIS PR)  ←  CPUID-based 14-variant detect
                                                                    + amx_available() OS-gate
                                                                    + cpu-* pinning constants

The matrix doc's TEST-promotion workflow needs the fine-grained variant (you can't promote "SPR" cells if your detector only says "amx_int8 tier"). SimdProfile is that surface; PR #187's CpuOps is the dispatch-table glue.

Out of scope (separate PRs)

  • Wiring simd_profile() as the seed for CpuOps selection (would replace the current is_x86_feature_detected! inside cpu_ops()).
  • Promoting matrix-doc cells DOCTEST using the probe binary on real SPR/GNR/Zen4 silicon.
  • A simd_profile().coarse() -> &'static str codename mapping for parity with cpu_ops_for_cpu().

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS


Generated by Claude Code

claude added 4 commits May 21, 2026 11:07
Phase 3 T3.1 of the SIMD integration plan: introduce
crate::hpc::simd_profile::SimdProfile, the silicon-grained dispatch
identity that replaces the coarse three-Tier collapse called out in
audit findings TD-T12/T13/T14.

The decision tree in SimdProfile::detect() implements
.claude/knowledge/td-simd-cpu-dispatch-matrix.md lines 271-305
verbatim, preserving the four load-bearing invariants from the
"Detection invariants" section: GraniteRapids-before-SapphireRapids,
Zen4-vs-SPR via amx_tile, CooperLake-vs-IceLakeSp via the mutually
exclusive BF16/VBMI bit pattern, and TigerLakeU-vs-IceLakeSp via
VP2INTERSECT.

Risk #4 of the integration plan (no GNR detection without leaf 7,1
reader) closed in the same change: SimdCaps gains avx512fp16,
avx512vp2intersect, and amx_fp16 fields, with the x86 detect() arm
adding a __cpuid_count(7, 1) read gated on the leaf 7,0 EAX max
subleaf advertising support. has_amx_fp16() requires amx_tile in
addition to the FP16 bit, mirroring the defense-in-depth pattern in
simd_amx::amx_available().

Surface follows cognitive-shader-foundation.md: SimdProfile +
simd_profile() re-exported through crate::simd::* so consumers
import a single public path. The existing private Tier / tier()
machinery in src/simd.rs is untouched; this lands alongside, with
incremental migration deferred to T3.5/T3.6.

Tests: 7 new in simd_profile (detection determinism, arch
partitioning, AVX-512 subset invariant, x86_64-only Scalar
fallback, name uniqueness), 2 new in simd_caps (FP16 fields false
on non-x86, has_amx_fp16 requires amx_tile). 2075/2075 lib tests
pass, clippy -D warnings clean.
Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin
simd_profile() to a const at compile time, bypassing the runtime
LazyLock detection from T3.1. One feature per non-Scalar variant of
the SimdProfile enum.

Features (mapping to LLVM target-cpu codenames):
  cpu-gnr         → GraniteRapids    (graniterapids)
  cpu-spr         → SapphireRapids   (sapphirerapids)
  cpu-zen4        → Zen4Avx512       (znver4)
  cpu-cpl         → CooperLake       (cooperlake)
  cpu-tigerlake   → TigerLakeU       (tigerlake)
  cpu-icx         → IceLakeSp        (icelake-server)
  cpu-clx         → CascadeLake      (cascadelake)
  cpu-skx         → SkylakeX         (skylake-avx512)
  cpu-arrowlake   → ArrowLake        (arrowlake)
  cpu-haswell     → HaswellAvx2      (haswell)
  cpu-a76         → A76DotProd       (cortex-a76)
  cpu-a72         → A72Fast          (cortex-a72)
  cpu-a53         → A53Baseline      (cortex-a53)

Mutual exclusion (per integration-plan risk #1: "if a user sets
cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on
AMX instructions") is enforced via a const assert: each cpu-*
contributes 1 to _PIN_COUNT and the assert fires at compile time if
the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4
simultaneously produces a build error citing the mutex.

Implementation: a const pinned_profile() -> Option<SimdProfile>
walks a cfg cascade and returns the active variant or None. The
simd_profile() function exists in two cfg-gated forms — a runtime
LazyLock variant compiled when no cpu-* feature is set, and a
const-foldable variant compiled when any is set. The LazyLock is
not linked into pinned binaries.

is_pinned() const helper exposes whether compile-time dispatch is
active, useful both for consumer-facing diagnostics and for gating
arch-detection tests that no longer apply when pinning overrides
hardware. Existing x86_target_lands_inside_x86_family /
aarch64_target_lands_inside_aarch64_family tests early-return when
pinned; two new tests (pinning_default_is_off, pinning_consistency)
verify the const + runtime paths agree.

Tests: 2077/2077 lib tests pass under both default and --features
cpu-spr configurations. cargo clippy --lib -- -D warnings clean in
both modes. Mutex compile_error verified by attempting
--features "cpu-spr,cpu-zen4" — fails as expected with the const
assert citation. The default (no feature) path is byte-identical to
the T3.1 runtime detection — no regression risk to the merged #181
work or to the e40f3a3 SimdProfile commit.
Supports step 1 of the TEST-promotion checklist from
.claude/knowledge/td-simd-cpu-dispatch-matrix.md § "TEST verification
checklist": "Boot the binary on the silicon and confirm
simd_profile() returns the expected variant."

The probe prints:
  - Resolved SimdProfile variant + arch/family flags
  - Compile-time pinning status (and pinned variant if active)
  - Every CPUID-derived SimdCaps bit, ticked/unticked
  - ARM heuristic profile when running on aarch64
  - Active compile-time target features (avx512f / avx2)
  - Per-variant matrix-doc cell summary (terse — matrix is source of truth)
  - Runs the same pinning_consistency invariant the unit test checks,
    so a probe deployed on real silicon flags regressions in the cfg
    cascade.

First-hardware results on the build host (Sapphire Rapids):
  - simd_profile() resolves to SapphireRapids ✓
  - amx_tile + amx_bf16 + avx512fp16 all set ✓
  - amx_fp16 unset (correctly NOT promoting to GraniteRapids) ✓
  - GNR-before-SPR ordering invariant verified end-to-end

This is the first end-to-end pass of the e40f3a3 SimdProfile detect
chain plus the 5a3a663 cpu-* pinning machinery on real silicon —
confirms the dispatch axis is functional, not just doc-checked.

Smoke-tested with --features cpu-zen4: probe correctly reports
ACTIVE pinning and Zen4Avx512 as the resolved variant, even on SPR
silicon (pinning intentionally overrides hardware detection).
…losure

Integration plan risk #3 ("Detection robustness across hypervisors"):
CPUID may advertise AMX-TILE while the OS/hypervisor has not enabled
the tile XSAVE state. Without the OS-level check, the dispatch table
routes to AMX kernels that SIGILL at first use.

Fix: SimdProfile::detect() now reads `simd_amx::amx_available()` (the
existing 4-step gate: CPUID → OSXSAVE → XCR0[17,18] → arch_prctl
XCOMP_PERM on Linux 5.19+) and demotes when CPUID and OS disagree.
The GraniteRapids and SapphireRapids arms now require both the CPUID
bits AND `amx_usable`; the Zen4Avx512 arm catches SPR-class CPUID
with locked-down hypervisor XSAVE so dispatch falls to the AVX-512
BF16/FP16 path instead.

Verified on the build host (Sapphire Rapids silicon, kernel 6.18.5):
  - CPUID reports amx_tile=1, amx_int8=1, amx_bf16=1 (all true)
  - simd_amx::amx_available() returns false (hypervisor masks
    XCR0[17,18] or the arch_prctl(XCOMP_PERM) request fails)
  - SimdProfile::detect() correctly resolves to Zen4Avx512, not
    SapphireRapids — the AMX kernels are not reachable from
    dispatch on this OS state.

Without this fix, the e40f3a3 detect path would have resolved to
SapphireRapids on this exact silicon/OS combination, then SIGILL'd
the first time a dispatch table called an AMX kernel. Bug closed
before any consumer was wired to the dispatch table.

The probe binary (examples/simd_profile_probe.rs) gains a new "AMX
gating (CPUID vs OS)" section so the CPUID-vs-OS gap is visible
without reading source. Format mirrors how the matrix-doc cell
summary appears: terse, two lines plus an optional demotion note
when the bits disagree.

Pinned mode (cpu-* cargo features) intentionally bypasses this gate
since pinning is a build-time assertion that the target OS supports
the chosen variant — pinned binaries are non-portable by design.

Tests: 2077/2077 lib pass. cargo clippy --lib clean under default
and --features cpu-spr. Behaviour on hardware with proper AMX
enablement (full prctl path success) is unchanged: SapphireRapids
still resolves to SapphireRapids when amx_available() returns true.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de52a446bd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/hpc/simd_profile.rs
Comment on lines +325 to +327
#[cfg(feature = "cpu-a76")]
{
return Some(SimdProfile::A76DotProd);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard cpu- pinning by target architecture*

The new pinning map accepts ARM pins on x86_64 (and x86 pins on aarch64) with no compile-time guard, so --features cpu-a76 on an x86_64 build will still compile and force simd_profile() to report A76DotProd. That makes is_x86()/is_aarch64() reflect the feature flag rather than the actual target and can silently route callers into the wrong dispatch family if a build matrix or release script passes the wrong cpu-* feature. Please add compile_error!/#[cfg] checks so incompatible pin features fail fast for the current target_arch.

Useful? React with 👍 / 👎.

…ample

Two independent bug fixes for codex P2 on de52a44 and the
no-default-features CI failure:

1) codex P2 — Reject ARM cpu-* on x86 builds (and vice versa).
   Without target_arch guards, `--features cpu-a76` on an x86_64
   build silently routes simd_profile() to A76DotProd, breaking
   the is_x86()/is_aarch64() partitioning and routing callers
   into the wrong dispatch family. Add compile_error! checks
   that fail fast for the current target_arch — same fail-fast
   pattern as the existing _PIN_COUNT mutual-exclusion assert.

2) CI fix — examples/simd_profile_probe.rs uses ndarray::hpc::*
   and ndarray::simd_amx::* which are gated behind the "std"
   feature. On `cargo test --no-default-features` the example
   target still tries to compile, producing E0433 "cannot find
   hpc in ndarray". Add `required-features = ["std"]` to the
   example's Cargo.toml entry so it is skipped when std is
   disabled, matching the existing pattern for ocr_benchmark.

No behavioral change on default builds. Both fixes are
independent of the architectural question about whether cpu-*
features should exist at all (which is for the originating
session to revisit if they want to unify with cpu_ops_for_cpu);
this just makes the existing features less of a footgun.
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
… AMX OS-gate in cpu_ops

Salvages the detection-only subset of closed PR #190 — three real
gaps in the substrate runtime dispatch without inheriting any of
PR #190's consumer-facing additions (no SimdProfile enum, no
public dispatch-identity API, no cpu-* features).

What lands here:

1) CPUID leaf 7,1 read for AMX-FP16 (CPUID.07H.1H:EAX bit 21).
   Lives on a different subleaf than the existing AMX bits;
   GraniteRapids is the only silicon advertising it today.
   Guarded by leaf 7,0 EAX >= 1 so older CPUs that don't expose
   subleaf 1 stay correct.

2) Three new SimdCaps fields (additive, all default false on
   non-x86):
   - avx512fp16  — CPUID.07H.0H:EDX bit 23 — `__m512h` math.
                   Discriminates SPR-class from CascadeLake/
                   IceLakeSp/SkylakeX for any future FP16 kernel.
   - avx512vp2intersect — CPUID.07H.0H:EDX bit 8 — TigerLake
                   mobile only; absent from Ice Lake-SP and every
                   later server part. Exposed for completeness.
   - amx_fp16     — CPUID.07H.1H:EAX bit 21 — Granite Rapids.

   Plus convenience methods has_avx512_fp16() and has_amx_fp16()
   (the latter defense-in-depths the amx_tile bit).

3) AMX OS-state gate in cpu_ops() selection. The CPU-reports-AMX
   path now AND-gates on `simd_amx::amx_available()` which runs
   the full four-step check (CPUID + OSXSAVE + XCR0[17,18] +
   arch_prctl(XCOMP_PERM, 18) on Linux 5.19+). This closes the
   SIGILL hole when a hypervisor masks XCR0 or the OS hasn't
   honoured the prctl: previously cpu_ops() would route to
   CPU_OPS_AMX_INT8 and AMX instructions would SIGILL despite
   the CPUID bit. Now it demotes to CPU_OPS_AVX512_VNNI cleanly.

What's deliberately NOT here (rejected from PR #190):

- No `SimdProfile` enum — would expose dispatch identity to
  consumer code and invite `match profile { ... }` arms that
  defeat the polyfill contract.
- No `cpu-*` cargo features — build-time silicon pinning that
  defeats polyfill at an earlier binding time.
- No `simd_profile_probe` example — diagnostic-only, rebuilds
  the SimdProfile surface this PR doesn't bring.
- No public dispatch-identity API at any layer. The new bits
  are internal substrate detection; consumers continue to use
  `crate::simd::*` polyfilled types and `crate::simd_runtime::*`
  per-op trampolines.

The new fields slot into existing `cpu_ops()` selection by
extension (e.g. a future AMX-FP16 tier would AND-gate on
`caps.amx_fp16 && simd_amx::amx_available()` between the
AMX-INT8 and AVX-512-VNNI arms). No selection logic uses them
yet — they're laying the runway, not consuming it.

Tests:
- 4 new simd_caps tests: cpuid_extended_bits_smoke,
  has_amx_fp16_requires_amx_tile,
  x86_extended_bits_are_false_on_non_x86, plus extended
  determinism coverage.
- All 6 existing cpu_ops tests still pass; the AMX OS-gate
  change passes through transparently on hosts where
  amx_available() agrees with CPUID (the typical case).
- fmt + clippy clean on `--features runtime-dispatch`.
AdaWorldAPI added a commit that referenced this pull request May 21, 2026
feat(simd_caps): CPUID 7,1 + new x86 caps fields + AMX OS-gate in cpu_ops (salvage from #190)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants